Picture for Jan Leike

Jan Leike

Tony

Excess Description Length of Learning Generalizable Predictors

Add code
Jan 08, 2026
Viaarxiv icon

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks

Add code
Jan 08, 2026
Viaarxiv icon

Unsupervised Elicitation of Language Models

Add code
Jun 11, 2025
Figure 1 for Unsupervised Elicitation of Language Models
Figure 2 for Unsupervised Elicitation of Language Models
Figure 3 for Unsupervised Elicitation of Language Models
Figure 4 for Unsupervised Elicitation of Language Models
Viaarxiv icon

Reasoning Models Don't Always Say What They Think

Add code
May 08, 2025
Figure 1 for Reasoning Models Don't Always Say What They Think
Figure 2 for Reasoning Models Don't Always Say What They Think
Figure 3 for Reasoning Models Don't Always Say What They Think
Figure 4 for Reasoning Models Don't Always Say What They Think
Viaarxiv icon

Auditing language models for hidden objectives

Add code
Mar 14, 2025
Figure 1 for Auditing language models for hidden objectives
Figure 2 for Auditing language models for hidden objectives
Figure 3 for Auditing language models for hidden objectives
Figure 4 for Auditing language models for hidden objectives
Viaarxiv icon

Forecasting Rare Language Model Behaviors

Add code
Feb 24, 2025
Figure 1 for Forecasting Rare Language Model Behaviors
Figure 2 for Forecasting Rare Language Model Behaviors
Figure 3 for Forecasting Rare Language Model Behaviors
Figure 4 for Forecasting Rare Language Model Behaviors
Viaarxiv icon

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Add code
Jan 31, 2025
Figure 1 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Figure 2 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Figure 3 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Figure 4 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Viaarxiv icon

GPT-4o System Card

Add code
Oct 25, 2024
Viaarxiv icon

Prover-Verifier Games improve legibility of LLM outputs

Add code
Jul 18, 2024
Viaarxiv icon

LLM Critics Help Catch LLM Bugs

Add code
Jun 28, 2024
Viaarxiv icon